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[57] ABSTRACT 

The reconfigurable pipelined processor includes a plu- 
rality of memory devices and arithmetic units intercon- 
nected by cross bars for transferring raw and processed 
data therebetween. A counter is connected with the 
cross bar to provide a source of addresses for the mem- 
ory devices. At least one variable tick delay device is 
connected with each memory and arithmetic unit to 
variably control the input and output operations thereof 
to selectively delay the memory devices and arithmetic 
units to align the data for processing in a selected se- 
quence. 

6 Claims, 5 Drawing Sheets 
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FIG. 6 
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make no provisions for storing multiple partial results 
RECONFIGURABLE PIPELINED PROCESSOR for variable numbers of clock periods. 

The present invention was developed in order to 
BACKGROUND OF THE INVENTION overcome these and other drawbacks of prior pipelined 

The present invention relates to a method and appara- 5 processors by providing a reconfigurable pipelined pro- 
rus providing reconfigurable pipelined digital process- cessor providing variable timing delays to align the data 
ing for use in massively parallel multi-processing sys- f° r processing in a selected sequence. The processor of 
terns. When pipelined calculations are to be performed, ^ e present invention thus provides inherent flexibility 
it is often necessary to delay one partial result by some 1Q f° r massively parallel multi-processing systems, 
amount of time before combining it with another partial SUMMARY OF THE INVENTION 

result Thus, for example, in the calculation 

Accordingly, it is a primary object of the present 
(ajxbjRcj invention to provide a method and apparatus for pro- 

cessing data in a reconfigurable pipelined processor, 
the ci value must be delayed during the computations The processor includes a plurality of memory devices 
ai X bi. As shown schematically in FIG. 1, a i, bi, and ci for storing bits of data, a plurality of arithmetic units for 
are results of a previous part of the pipeline, and the performing arithmetic functions with the data, and a 
three operands are supplied as inputs to registers I, 2, cross bar device for connecting the memory devices 
and 3, respectively, on every clock pulse. The operands ^ with the arithmetic units for transferring data therebe- 
ai and b] are multiplied in a multiplier 2 with the result tween. Counters are connected with the cross bar in 
being stored in register 4. In the meantime, ci is held or order to provide a source of addresses for the memory 
delayed in register 5 during multiplication of aiXbi- devices. At least one variable tick delay device is con- 
The product of ai Xbi is added to ci in an adder 4 and nected with each of the memory devices and arithmetic 
the sum is stored in register 6. FIG. 2 is a table showing ^ units for variably controlling the input and output oper- 
the contents of each register for clocks pulses Ti-Tg. a tions thereof to selectively delay the memory devices 
In the design of a pipeline, it is important that there be and arithmetic units to align the data for processing in a 
proper connections between registers and arithmetic selected sequence. 

functional units, and that there be proper timing align- \ t ^ another object of the invention to provide at least 
ment of data for processing. The reconfigurable pipe- 30 one independent variable rick delay device connected 
lined processor of the present invention provides both the cross bar for re-aligning data during processing 

of these features. in a selected sequence. 

BRIEF DESCRIPTION OF THE PRIOR ART According to a more specific object of the invention, 

each variable tick delay device includes a plurality of 

Pipelined processors are well-known in the patented 35 multiplexers each having a plurality of pipelined regis- 
prior art as evidenced by US. Pat. No. 4,525,796. Typi- teis connected therewith. The number of registers in 
cally, flexible processors of the pnor art are configured lhe pipelined data path is determined by the control bits 
as schemaUcaUy represented in FIG. 3. A plurality of delivered to each multiplexer. . . 
memory devices 6, 8, 10, 12, and a plurality of arithme- 
tic units 14, 16, 18 are interconnected via a cross bar 40 BRIEF DESCRIPTION OF THE FIGURES 
device 20. The cross bar provides the necessary flexibil- Other objects and advantages of the subject invention 
ity, whereby the output of any device or unit may be ^ become apparen t from a study of the following 
connected with the input of any device or unit. This specification when viewed in the light of the accornpa- 
affords ^flexibility 'for non-pipelined applications and for n ^ drawing> in which: 

seated pitted applications wheretbe delay of inter- 45 mG x fa a schcmtic mustration of a simple pipelined 
mediary pipelined data is not required. processing system for performing a simple calculation; 

for example, for Pj G 2 fa a ^ mustrating ^ cont ents 0 f the regis- 

C| -=a/+b|for i=i-n. ters of the processor of FIG. 1 at various times in the 

processing cycle; 

then in FIG. 3, let FIG * 3 a Wock diagram of a pipelined processor of 

MEMORY 1 hold a, Ae P" 01 

MEMORY 2 hold b/ * * a *>\ock diagram of a reconfigurable pipe- 

MEMORY 3 store results c/ P rocessor according to the invention; 

In operation, the cross bar is controlled to allow the 55 FIG ' 5 * a b,ock ^gram of a memory device of the 
following connections: processor of FIG. 4 having variable rick delay devices 

connected with each port thereof; 
MEMORY 1— operand 1 of MULT/DIV 1 FIGS. 6 and 6A are schematic diagrams of a variable 

tick delay device and control register therefor, respec- 
MEMORY 2— Kjpcrand 2 of MULT/DIV 1 ^ tively; 

. FIG. 7 is a table illustrating the relationship between 

oaiput MULT/DIV i-MEMORY 3 the control bits and the number of registers required for 

vu . ' - . the variable tick delay device of FIG. 6: 

when it is necessary to configure a more complex rrn~t? « . ft u . . r 

pipeline where multiple partial results must be delayed c " GS - * mi 9 m examples of a sparse 

by^erentamountsoftime.thearcbitectuxeofFlG.S 65 and a Pressor configured for the sparse matnx, 

has serious drawbacks resulting from the lack of consid- ' eSpeC " Ve ' y ' for operation of the recon- 

eration for data timing alignment. Prior flexible designs f ! eurable p, P eUned pr0ceSSor wording to the mven- 
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DETAILED DESCRIPTION 

Referring to FIG. 4> the reconfigurable pipelined 
processor of the present invention comprises a plurality 
of memory devices 106, 108 F 110, 112, a plurality of 5 
arithmetic units 114, 116, 118, all of which are intercon- 
nected via a cross bar device 120. Each memory device 
stores bits of data, whether raw or processed. The arith- 
metic units perform arithmetic functions with the data. 
Thus the unit 114 is an adder for adding data together 10 
while the units 116 and 118 are combination multipliers 
and dividers for multiplying and/or dividing data. 
While the pipelined processor of FIG. 4 is shown com- 
prising four memory devices and three arithmetic units, 
it will be appreciated by those skilled in the art that any 15 
number of memories and arithmetic units may be pro- 
vided in accordance with the complexity of the process- 
ing operations being performed. 

At least one counter 122 is connected with the cross 
bar to provide a source for addresses to the memory 20 
devices. As will be developed in greater detail below, 
the clock rate or period provides the timing control of 
the elements of the processor for data transfer and for 
functional operation of each element. 

As shown in FIG. 4 and as will be discussed in greater 25 
detail below, each of the memory devices and arithme- 
tic units includes a variable tick delay (VTD) device. 
The VTD devices variably control the input and output 
operations of the memories and the arithmetic units to 
selectively delay these elements in order to align the 30 
data for processing in a selected sequence. Thus while 
the cross bar provides connectivity between the ele» 
ments of the pipelined processor, the VTD's provide 
data alignment relative to the clocking period for ap- 
propriate data processing. 35 

As shown in FIG. 5, the VTD's 124 are preferably 
used as inputs to and outputs from each memory 126. 
Similarly, the VTD's are used as inputs to each arithme- 
tic unit If desired, VTD's may also be used as outputs 
from each arithmetic unit Also, as shown in FIG. 4, a 40 
variable tick delay device may be provided as an inde- 
pendent element connected directly with the cross bar 
for re-aligning data during processing in a selected se- 
quence. 

A sample variable tick delay device is shown in 45 
FIGS. 6 and 6A. It includes a plurality of multiplexers 
the number of which is determined by the number of 
data bits, and each multiplexer has a plurality of se- 
rialled input registers and output registers connected 
therewith. In the example of FIGS. 6 and 6A, the VTD 50 
has 64 bits of data input 4 bits of data output and 32 bits 
of control (c©-C3i). The 64 bits of data are indepen- 
dently controlled as bytes by means of four control bits 
for each byte. The number of control bits required can 
be substantially reduced by controlling all 64 bits by 55 
four control lines or by controlling two 32 bit data paths 
with four bits of control per path. Regardless of 
whether the data bits are controlled as bytes, half words 
(32 bits) or full words (64 bits), the control bits deter- 
mine the number of registers in the pipelined data path 60 
as shown in FIG. 7. Simplified examples of variable tick 
delay devices are models L29C520/L29C521 manufac- 
tured by Logic Devices Incorporated. 

In the operation of the reconfigurable pipelined pro- 
cessor of FIG. 4, input data is loaded into the memory 65 
devices and the cross bar device is operated to selec- 
tively transfer the data between the memory devices 
and the arithmetic units in accordance with a predeter- 



mined clocking rate. The arithmetic units are operated 
to perform arithmetic functions with the data in accor- 
dance with the clocking rate. Flexibility, i.e. reconfigu- 
ration, of the processor is accomplished by varying the 
delay of the input and output of data from the memory 
devices and arithmetic units relative to the clocking rate 
in order to align the data for processing in a selected 
sequence. The processed data is then unloaded from the 
memory device. A complete understanding of the oper- 
ation of the reconfigurable pipelined processor of the 
invention by way of an example is illustrated in FIGS. 
8 and 9. This example is for processing a lXn row 
vector of real numbers multiplied by a sparse square 
matrix (nXn, with many zero entries) of real numbers to 
yield a 1 Xn row vector as the result as shown in FIG. 
8. 

Rather than store the many zero entries of the sparse 
matrix, only the non-zero entries will be stored along 
with a tag field containing the i and j values correspond- 
ing with the data's row (i) and column 0 location in the 
square matrix. 

In order to concentrate on the unique features of the 
invention, the following simplification will be used in 
describing the operation of the example of FIG. 9: 

(1) assume that the memory devices are large enough 
to store the vectors and the sparse matrix; 

(2) ignore the word size (number of bits) of data and 
tag field (i and j values); and 

(3) ignore cross bar registration delays and let the 
symbol ® denote a connection through a cross bar. 

The processor configuration of FIG. 9 includes three 
memory devices. MEM 1 is assigned the sparse matrix 
Ay (data and tags), MEM 2 is assigned the X vector, and 
MEM 3 is assigned the Y vector (results). 

The symbols d and T shown in the VTD's of FIG. 9 
represent the byte control and pipeline length, respec- 
tively, as defined in FIG. 7 for a VTD. 

MEM 1 is read sequentially as indicated by the 
counter driving the address VTD 300 for MEM 1 via 
the cross bar connection. On every clock pulse, MEM 1 
data and tag fields are loaded into the data output VTD 
302 of MEM 1 where the following delays are pro- 
grammed in the VTD: 

(1) tag i, d=0; 2T through VTD 302 

(2) data, d=4; 6T through VTD 302; delayed by 4 to 
allow readout of X, quantity to be aligned with 
MEM 1 data; and 

(3) tag j, d— 6; 8T through VTD 302; delayed to line 
up data coming from Y/ memory with X,xA,y from 
the multiplier 304. 

The i tag will supply an address through the cross bar 
to the memory MEM 2 address VTD 306 which is set to 
d=0 t corresponding to the minimum of two levels of 
registration. The memory MEM 2 data is loaded into 
the MEM 2 data output VTD 308 which is also set to 
d=0. The data output from the VTD 308 drives one 
operand VTD 310 of multiplier 304 while the delayed 
data from the A# memory's MEM 1 VTD 302 drives the 
other operand VTD 312 of the multiplier 304. Both of 
the operands VTD 310 and VTD 312 are set to d=0. 
The multiplier 304 performs floating point multiplica- 
tion using a four-tick (4T) pipelined algorithm having 
four levels of registration. 

During the 4T duration of the multiplication, the 
delayed j tag drives the address VTD 314 of the mem- 
ory MEM 3. Since the memory MEM 3 will provide 
the storage for the resulting vector Y/, it must be de- 
signed to run at two times the arithmetic pipeline rate in 
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order to perform the summation from the adder 316 and 
store of the result. Since the j tag must be saved in order 
to provide a storage address after the summation is 
performed, the address VTD 314 of MEM 3 will drive 
memory MEM 3 and also VTD 318 via a cross bar. The 
VTD 318 is set to d=4 (6T) in order to delay the j tag 
so as to be aligned with the data output from the adder 
316. The data read from the memory MEM 3 drives one 
operand VTD 320 of the adder 316 while the output of io 
the multiplier 304 drives the other operand VTD 322 of 
the adder 316. The adder 316 performs a floating point 
addition requiring two ticks 2T. The sum output from 
the adder 316 is thus aligned with the stored j tag and 
they feed the memory MEM 3 via the data-in VTD 324 15 
and the address VTD 314, respectively. 

The reconfigurable pipelined processor and method 
for processing according to the invention thus enable 
the pipeline to be reconfigured as desired with relative 20 
ease since the alignment of data is provided in accor- 
dance with proper byte control of the variable tick 
delay devices. 

While in accordance with the provisions of the patent 
statutes the preferred forms and embodiments of the 25 
invention have been illustrated and described, it will be 
apparent to those skilled in the art that various changes 
and modifications may be made without deviating from 
the inventive concepts set forth above. 30 

What is claimed is: 

I. A reconfigurable pipelined processor for process- 
ing data, comprising: 

(a) a plurality of memory devices for storing bits of 
data; 35 

(b) a plurality of arithmetic units for performing 
arithmetic functions with the data; 

(c) cross bar means for connecting said memory de- 
vices with said arithmetic units for transferring ^ 
data therebetween; 

(d) at least one counter connected with said cross bar 
means for providing a source of addresses to said 
memory devices; 



(e) at least one variable tick delay device connected 
with each of said memory devices and arithmetic 
units; and 

(0 means for providing control bits to said variable 
tick delay device for variably controlling the input 
and output operations thereof to selectively delay 
said memory devices and arithmetic units to align 
the data for processing in a selected sequence. 

2. Apparatus as defined in claim 1, wherein a variable 
tick delay device is connected with each input and 
output of each of said memory devices and arithmetic 
units. 

3. Apparatus as defined in claim 2, and further com- 
prising at least one independent variable tick delay de- 
vice connected with said cross bar means for re-aligning 
data during processing in a selected sequence. 

4. Apparatus as defined in claim 3, wherein said vari- 
able tick delay devices each comprise a plurality of 
multiplexers each having a plurality of pipelined regis- 
ters connected therewith, the number of registers in the 
pipelined data path being determined by the control bits 
delivered to each multiplexer. 

5. A method for processing data in a pipelined proces- 
sor including a plurality of memory devices and arith- 
metic units interconnected by cross bars, comprising the 
steps of 

(a) loading input data into the memory devices; 

(b) selectively operating the cross bars to selectively 
transfer the data between the memory devices and 
the arithmetic units in accordance with- a predeter- 
mined clocking rate; 

(c) controlling the operation of the arithmetic units to 
perform arithmetic functions with the data; 

(d) varying the delay of the input and output of data 
from the memory devices and arithmetic units rela- 
tive to the clocking rate to align the data for pro- 
cessing in a selected sequence; and 

(e) unloading the processed data from the memory 
device. 

6. The method as defined in claim 5, and further 
comprising the step of controlling read/write signals to 
the memory devices to facilitate the transfer of data 
relative thereto. 
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